The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Fri May 29 23:09:30 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Fri May 29 23:09:30 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X5.29.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X5.29.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X5.29.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X5.29.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 129 129
Albania 129 129
Algeria 129 129
Andorra 129 129
Angola 129 129
Antigua and Barbuda 129 129
Argentina 129 129
Armenia 129 129
Australia 1032 1032
Austria 129 129
Azerbaijan 129 129
Bahamas 129 129
Bahrain 129 129
Bangladesh 129 129
Barbados 129 129
Belarus 129 129
Belgium 129 129
Belize 129 129
Benin 129 129
Bhutan 129 129
Bolivia 129 129
Bosnia and Herzegovina 129 129
Botswana 129 129
Brazil 129 129
Brunei 129 129
Bulgaria 129 129
Burkina Faso 129 129
Burma 129 129
Burundi 129 129
Cabo Verde 129 129
Cambodia 129 129
Cameroon 129 129
Canada 1806 1806
Central African Republic 129 129
Chad 129 129
Chile 129 129
China 4257 4257
Colombia 129 129
Comoros 129 129
Congo (Brazzaville) 129 129
Congo (Kinshasa) 129 129
Costa Rica 129 129
Cote d’Ivoire 129 129
Croatia 129 129
Cuba 129 129
Cyprus 129 129
Czechia 129 129
Denmark 387 387
Diamond Princess 129 129
Djibouti 129 129
Dominica 129 129
Dominican Republic 129 129
Ecuador 129 129
Egypt 129 129
El Salvador 129 129
Equatorial Guinea 129 129
Eritrea 129 129
Estonia 129 129
Eswatini 129 129
Ethiopia 129 129
Fiji 129 129
Finland 129 129
France 1419 1419
Gabon 129 129
Gambia 129 129
Georgia 129 129
Germany 129 129
Ghana 129 129
Greece 129 129
Grenada 129 129
Guatemala 129 129
Guinea 129 129
Guinea-Bissau 129 129
Guyana 129 129
Haiti 129 129
Holy See 129 129
Honduras 129 129
Hungary 129 129
Iceland 129 129
India 129 129
Indonesia 129 129
Iran 129 129
Iraq 129 129
Ireland 129 129
Israel 129 129
Italy 129 129
Jamaica 129 129
Japan 129 129
Jordan 129 129
Kazakhstan 129 129
Kenya 129 129
Korea, South 129 129
Kosovo 129 129
Kuwait 129 129
Kyrgyzstan 129 129
Laos 129 129
Latvia 129 129
Lebanon 129 129
Lesotho 129 129
Liberia 129 129
Libya 129 129
Liechtenstein 129 129
Lithuania 129 129
Luxembourg 129 129
Madagascar 129 129
Malawi 129 129
Malaysia 129 129
Maldives 129 129
Mali 129 129
Malta 129 129
Mauritania 129 129
Mauritius 129 129
Mexico 129 129
Moldova 129 129
Monaco 129 129
Mongolia 129 129
Montenegro 129 129
Morocco 129 129
Mozambique 129 129
MS Zaandam 129 129
Namibia 129 129
Nepal 129 129
Netherlands 645 645
New Zealand 129 129
Nicaragua 129 129
Niger 129 129
Nigeria 129 129
North Macedonia 129 129
Norway 129 129
Oman 129 129
Pakistan 129 129
Panama 129 129
Papua New Guinea 129 129
Paraguay 129 129
Peru 129 129
Philippines 129 129
Poland 129 129
Portugal 129 129
Qatar 129 129
Romania 129 129
Russia 129 129
Rwanda 129 129
Saint Kitts and Nevis 129 129
Saint Lucia 129 129
Saint Vincent and the Grenadines 129 129
San Marino 129 129
Sao Tome and Principe 129 129
Saudi Arabia 129 129
Senegal 129 129
Serbia 129 129
Seychelles 129 129
Sierra Leone 129 129
Singapore 129 129
Slovakia 129 129
Slovenia 129 129
Somalia 129 129
South Africa 129 129
South Sudan 129 129
Spain 129 129
Sri Lanka 129 129
Sudan 129 129
Suriname 129 129
Sweden 129 129
Switzerland 129 129
Syria 129 129
Taiwan* 129 129
Tajikistan 129 129
Tanzania 129 129
Thailand 129 129
Timor-Leste 129 129
Togo 129 129
Trinidad and Tobago 129 129
Tunisia 129 129
Turkey 129 129
Uganda 129 129
Ukraine 129 129
United Arab Emirates 129 129
United Kingdom 1419 1419
Uruguay 129 129
US 129 129
US_state 420669 420669
Uzbekistan 129 129
Venezuela 129 129
Vietnam 129 129
West Bank and Gaza 129 129
Western Sahara 129 129
Yemen 129 129
Zambia 129 129
Zimbabwe 129 129
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 4467
Alaska 839
Arizona 1121
Arkansas 4680
California 4275
Colorado 4067
Connecticut 664
Delaware 275
Diamond Princess 74
District of Columbia 75
Florida 4770
Georgia 10654
Grand Princess 75
Guam 75
Hawaii 387
Idaho 2129
Illinois 6115
Indiana 6193
Iowa 5682
Kansas 4707
Kentucky 6829
Louisiana 4475
Maine 1135
Maryland 1745
Massachusetts 1127
Michigan 5413
Minnesota 5100
Mississippi 5548
Missouri 6145
Montana 1910
Nebraska 3541
Nevada 849
New Hampshire 785
New Jersey 1685
New Mexico 1876
New York 4265
North Carolina 6536
North Dakota 2218
Northern Mariana Islands 60
Ohio 5829
Oklahoma 4415
Oregon 2243
Pennsylvania 4601
Puerto Rico 75
Rhode Island 442
South Carolina 3237
South Dakota 2813
Tennessee 6301
Texas 13236
Utah 1032
Vermont 1050
Virgin Islands 75
Virginia 8242
Washington 2917
West Virginia 3013
Wisconsin 4418
Wyoming 1366
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 41698
China 129
Diamond Princess 110
Korea, South 100
Japan 99
Italy 97
Iran 94
Singapore 91
France 90
Germany 90
Spain 89
US 88
Switzerland 86
United Kingdom 86
Belgium 85
Netherlands 85
Norway 85
Sweden 85
Austria 83
Malaysia 82
Australia 81
Bahrain 81
Denmark 81
Canada 80
Qatar 80
Iceland 79
Brazil 78
Czechia 78
Finland 78
Greece 78
Iraq 78
Israel 78
Portugal 78
Slovenia 78
Egypt 77
Estonia 77
India 77
Ireland 77
Kuwait 77
Philippines 77
Poland 77
Romania 77
Saudi Arabia 77
Indonesia 76
Lebanon 76
San Marino 76
Thailand 76
Chile 75
Pakistan 75
Luxembourg 74
Peru 74
Russia 74
Ecuador 73
Mexico 73
Slovakia 73
South Africa 73
United Arab Emirates 73
Armenia 72
Colombia 72
Croatia 72
Panama 72
Serbia 72
Taiwan* 72
Turkey 72
Argentina 71
Bulgaria 71
Latvia 71
Uruguay 71
Algeria 70
Costa Rica 70
Dominican Republic 70
Hungary 70
Andorra 69
Bosnia and Herzegovina 69
Jordan 69
Lithuania 69
Morocco 69
New Zealand 69
North Macedonia 69
Vietnam 69
Albania 68
Cyprus 68
Malta 68
Moldova 68
Brunei 67
Burkina Faso 67
Sri Lanka 67
Tunisia 67
Ukraine 66
Azerbaijan 65
Ghana 65
Kazakhstan 65
Oman 65
Senegal 65
Venezuela 65
Afghanistan 64
Cote d’Ivoire 64
Cuba 63
Mauritius 63
Uzbekistan 63
Cambodia 62
Cameroon 62
Honduras 62
Nigeria 62
West Bank and Gaza 62
Belarus 61
Georgia 61
Bolivia 60
Kosovo 60
Kyrgyzstan 60
Montenegro 60
Congo (Kinshasa) 59
Kenya 58
Niger 57
Guinea 56
Rwanda 56
Trinidad and Tobago 56
Paraguay 55
Bangladesh 54
Djibouti 52
El Salvador 51
Guatemala 50
Madagascar 49
Mali 48
Congo (Brazzaville) 45
Jamaica 45
Gabon 43
Somalia 43
Tanzania 43
Ethiopia 42
Burma 41
Sudan 40
Liberia 39
Maldives 37
Equatorial Guinea 36
Cabo Verde 34
Sierra Leone 32
Guinea-Bissau 31
Togo 31
Zambia 30
Eswatini 29
Chad 28
Tajikistan 27
Haiti 25
Sao Tome and Principe 25
Benin 23
Nepal 23
Uganda 23
Central African Republic 22
South Sudan 22
Guyana 20
Mozambique 19
Yemen 15
Mongolia 14
Mauritania 11
Nicaragua 11
Malawi 5
Syria 5
Zimbabwe 3
Bahamas 2
Libya 2
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 129
Korea, South 100
Japan 99
Italy 97
Iran 94
Singapore 91
France 90
Germany 90
Spain 89
US 88
Switzerland 86
United Kingdom 86
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-03-31        18352
## 2    Afghanistan           <NA> <NA> 2020-04-25        18377
## 3    Afghanistan           <NA> <NA> 2020-04-12        18364
## 4    Afghanistan           <NA> <NA> 2020-01-28        18289
## 5    Afghanistan           <NA> <NA> 2020-04-01        18353
## 6    Afghanistan           <NA> <NA> 2020-01-27        18288
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      4                   174     0.02298851
## 2                     47                  1463     0.03212577
## 3                     18                   607     0.02965404
## 4                      0                     0            NaN
## 5                      4                   237     0.01687764
## 6                      0                     0            NaN
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  2.240549                   0.602060        18348
## 2                  3.165244                   1.672098        18348
## 3                  2.783189                   1.255273        18348
## 4                      -Inf                       -Inf        18348
## 5                  2.374748                   0.602060        18348
## 6                      -Inf                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1              4  NA   NA         NA                           NA
## 2             29  NA   NA         NA                           NA
## 3             16  NA   NA         NA                           NA
## 4            -59  NA   NA         NA                           NA
## 5              5  NA   NA         NA                           NA
## 6            -60  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Alaska Transit -1.0000000 -63.0
Hawaii Retail_Recreation 0.9927184 -56.0
Hawaii Grocery_Pharmacy 0.9685458 -34.0
New Hampshire Parks 0.9575772 -20.0
Maine Transit -0.9085243 -50.0
Connecticut Grocery_Pharmacy -0.8981020 -6.0
Alaska Residential 0.8713807 13.0
Utah Residential -0.8683132 12.0
South Dakota Parks 0.8584565 -26.0
Vermont Parks 0.8511470 -35.5
Alaska Grocery_Pharmacy -0.7967206 -7.0
Utah Transit -0.7958829 -18.0
Wyoming Parks -0.7916325 -4.0
Hawaii Residential -0.7879860 19.0
Massachusetts Workplace -0.7577465 -39.0
Connecticut Transit -0.7547646 -50.0
Rhode Island Workplace -0.7539745 -39.5
Wyoming Transit -0.7182777 -17.0
Alaska Workplace -0.6977774 -33.0
Hawaii Parks 0.6783816 -72.0
Utah Parks -0.6774183 17.0
Maine Workplace -0.6561298 -30.0
Arizona Grocery_Pharmacy -0.6532996 -15.0
New York Workplace -0.6479082 -34.5
Vermont Grocery_Pharmacy -0.6471188 -25.0
Rhode Island Retail_Recreation -0.6296900 -45.0
Utah Workplace -0.6294964 -37.0
Hawaii Transit 0.6156932 -89.0
Rhode Island Residential -0.6121146 18.5
New Jersey Parks -0.5991942 -6.0
New York Retail_Recreation -0.5876281 -46.0
New Jersey Workplace -0.5851051 -44.0
Nebraska Workplace 0.5841366 -32.0
Hawaii Workplace 0.5430439 -46.0
North Dakota Parks 0.5320297 -34.0
New York Parks 0.5273915 20.0
Connecticut Residential 0.5265597 14.0
North Dakota Retail_Recreation -0.5203813 -42.0
Arizona Retail_Recreation -0.5181071 -42.5
Maine Parks 0.5153281 -31.0
Connecticut Retail_Recreation -0.5109702 -45.0
Massachusetts Retail_Recreation -0.5101024 -44.0
Connecticut Workplace -0.4863477 -39.0
Montana Parks -0.4848389 -58.0
New Jersey Retail_Recreation -0.4837356 -62.5
New Jersey Grocery_Pharmacy -0.4835888 2.5
Nebraska Residential -0.4825942 14.0
West Virginia Parks 0.4804961 -33.0
Iowa Parks -0.4770625 28.5
New Mexico Grocery_Pharmacy -0.4742289 -11.0
Kansas Parks 0.4735686 72.0
Montana Residential 0.4710850 14.0
Montana Workplace -0.4691822 -40.5
Rhode Island Parks 0.4677024 52.0
Wyoming Workplace -0.4596745 -31.0
Arizona Residential 0.4519530 13.0
New Mexico Residential 0.4490680 13.5
New Mexico Parks 0.4437786 -31.5
Illinois Transit -0.4398974 -31.0
Vermont Residential 0.4363644 11.5
Pennsylvania Workplace -0.4323193 -36.0
South Carolina Workplace 0.4319011 -30.0
Kentucky Parks -0.4255129 28.5
New Jersey Transit -0.4202403 -50.5
Nevada Transit -0.4180360 -20.0
Massachusetts Grocery_Pharmacy -0.4126551 -7.0
California Residential 0.4113566 14.0
California Transit -0.4103473 -42.0
New Hampshire Residential -0.4099077 14.0
Maryland Workplace -0.4096073 -35.0
Arizona Transit 0.4065996 -38.0
Alabama Grocery_Pharmacy -0.4017725 -2.0
Maryland Grocery_Pharmacy -0.3980787 -10.0
Idaho Workplace -0.3897999 -29.5
Idaho Grocery_Pharmacy -0.3888434 -5.0
Wisconsin Transit -0.3866467 -23.5
Idaho Transit -0.3751680 -30.0
Pennsylvania Retail_Recreation -0.3736405 -45.0
New York Transit -0.3730669 -48.0
Alabama Workplace -0.3705570 -29.0
New Mexico Retail_Recreation -0.3499478 -42.5
Wyoming Grocery_Pharmacy -0.3495242 -10.0
Michigan Parks 0.3464100 30.0
Idaho Residential -0.3419564 11.0
Nebraska Grocery_Pharmacy 0.3395341 -0.5
California Parks -0.3292011 -38.5
North Dakota Workplace 0.3165174 -40.0
North Carolina Grocery_Pharmacy 0.3159278 0.0
Alaska Retail_Recreation 0.3157015 -39.0
Maine Retail_Recreation -0.3151257 -42.0
Minnesota Transit -0.3097998 -28.5
Alabama Transit -0.3090148 -36.5
Florida Residential 0.3063428 14.0
Vermont Retail_Recreation 0.3012336 -57.0
Virginia Transit -0.3005807 -33.0
Maryland Retail_Recreation -0.2929384 -39.0
Pennsylvania Parks 0.2898458 12.0
Colorado Residential 0.2870917 14.0
Arkansas Retail_Recreation -0.2810057 -30.0
North Carolina Workplace 0.2795454 -31.0
Illinois Workplace -0.2757704 -30.5
Mississippi Residential 0.2751636 13.0
Vermont Workplace -0.2702651 -43.0
Nevada Residential 0.2697468 17.0
Rhode Island Transit -0.2692428 -56.0
Idaho Retail_Recreation -0.2682283 -40.0
Oregon Grocery_Pharmacy 0.2681161 -7.0
Maryland Residential 0.2636790 15.0
Texas Residential -0.2621788 15.0
Kansas Workplace 0.2579015 -32.5
Texas Workplace 0.2574978 -32.0
Rhode Island Grocery_Pharmacy 0.2569036 -7.5
Tennessee Workplace -0.2505019 -31.0
Illinois Parks 0.2495385 26.5
Nevada Retail_Recreation -0.2463951 -43.0
Utah Retail_Recreation -0.2438855 -40.0
Wisconsin Parks 0.2428143 51.5
West Virginia Grocery_Pharmacy -0.2380990 -6.0
Florida Parks -0.2374027 -43.0
Tennessee Residential 0.2362973 11.5
Missouri Residential -0.2359610 13.0
Pennsylvania Grocery_Pharmacy -0.2337994 -6.0
Arkansas Residential 0.2333473 12.0
South Carolina Parks -0.2317263 -23.0
New York Grocery_Pharmacy -0.2312898 8.0
California Grocery_Pharmacy -0.2311417 -11.5
Texas Parks 0.2304409 -42.0
Georgia Grocery_Pharmacy -0.2287998 -10.0
Montana Retail_Recreation -0.2281961 -51.0
California Retail_Recreation -0.2224535 -44.0
Michigan Workplace -0.2129339 -40.0
Washington Workplace -0.2118822 -38.0
West Virginia Workplace 0.2088268 -33.0
North Carolina Transit 0.2079638 -32.0
North Carolina Residential 0.2024183 13.0
Mississippi Grocery_Pharmacy -0.2010488 -8.0
Kansas Grocery_Pharmacy -0.1991460 -14.0
Illinois Residential 0.1981369 14.0
New Jersey Residential 0.1917661 18.0
Georgia Workplace -0.1887871 -33.5
California Workplace -0.1871230 -36.0
North Dakota Grocery_Pharmacy -0.1867100 -8.0
Colorado Parks -0.1841075 2.0
Montana Grocery_Pharmacy -0.1840248 -16.0
Virginia Residential 0.1816840 14.0
Iowa Transit 0.1805088 -24.0
Georgia Retail_Recreation -0.1769072 -41.0
Oregon Residential 0.1694278 10.5
New Mexico Transit 0.1689997 -38.5
Wisconsin Residential -0.1686413 14.0
Missouri Workplace 0.1684975 -28.0
Connecticut Parks 0.1682501 43.0
Florida Retail_Recreation 0.1661505 -43.0
South Dakota Transit -0.1644080 -40.0
Ohio Transit 0.1602579 -28.0
Minnesota Parks 0.1564304 -9.0
Virginia Grocery_Pharmacy -0.1562518 -8.0
Pennsylvania Transit -0.1559877 -42.0
Washington Grocery_Pharmacy 0.1558173 -7.0
Washington Residential 0.1533082 13.0
Georgia Residential -0.1532067 13.0
Kentucky Grocery_Pharmacy 0.1531086 4.0
New Hampshire Retail_Recreation -0.1512776 -41.0
Indiana Retail_Recreation 0.1509680 -38.0
Indiana Residential 0.1457314 12.0
Massachusetts Parks 0.1435329 39.0
Alabama Parks 0.1423910 -1.0
Wisconsin Workplace -0.1419711 -31.0
Oklahoma Residential 0.1393227 15.0
Virginia Parks 0.1334361 6.0
North Dakota Transit 0.1331566 -48.0
Mississippi Transit -0.1324273 -38.5
Alabama Retail_Recreation 0.1310157 -39.0
South Dakota Residential 0.1306466 15.0
Michigan Retail_Recreation -0.1296980 -53.0
Oregon Retail_Recreation 0.1289041 -40.5
Massachusetts Transit -0.1262082 -45.0
Kentucky Transit -0.1209402 -31.0
Mississippi Retail_Recreation -0.1200679 -40.0
Ohio Parks -0.1169918 67.5
Texas Grocery_Pharmacy 0.1156069 -14.0
South Carolina Residential -0.1145974 12.0
Texas Transit 0.1144337 -41.0
New Hampshire Grocery_Pharmacy -0.1140246 -6.0
Minnesota Retail_Recreation 0.1138223 -40.0
South Dakota Retail_Recreation -0.1130686 -38.5
South Dakota Workplace 0.1116683 -35.0
Maryland Transit -0.1108680 -39.0
Nebraska Transit -0.1101000 -9.0
Wyoming Retail_Recreation -0.1082501 -39.0
Massachusetts Residential 0.1069516 15.0
North Carolina Parks -0.1056545 7.0
Arkansas Workplace -0.1053620 -26.0
Wyoming Residential 0.1052173 12.5
Nebraska Retail_Recreation 0.1051552 -36.0
Kansas Transit -0.1050548 -26.5
Washington Transit -0.1050436 -33.5
Minnesota Workplace -0.1045300 -33.0
Arizona Workplace -0.1030443 -35.0
Ohio Residential 0.1024956 14.0
Wisconsin Grocery_Pharmacy 0.1022606 -1.0
Maine Residential -0.1017816 11.0
Idaho Parks 0.1012973 -22.0
Indiana Parks -0.0995296 29.0
Oklahoma Parks -0.0987219 -18.5
Oregon Parks 0.0987043 16.5
Florida Workplace -0.0984273 -33.0
Arkansas Transit 0.0963279 -27.0
Georgia Parks 0.0961677 -6.0
Oklahoma Grocery_Pharmacy -0.0954410 -0.5
New York Residential 0.0928162 17.5
Nevada Parks 0.0905156 -12.5
Arkansas Parks -0.0903116 -12.0
Virginia Workplace -0.0901875 -31.5
Pennsylvania Residential 0.0883833 15.0
Indiana Workplace 0.0863303 -34.0
Missouri Transit -0.0837429 -24.5
Oregon Transit 0.0830139 -27.5
Tennessee Parks -0.0825233 10.5
Mississippi Workplace -0.0810152 -33.0
Virginia Retail_Recreation -0.0807355 -35.0
Michigan Residential 0.0782420 15.0
Kentucky Retail_Recreation 0.0765195 -29.0
Maine Grocery_Pharmacy -0.0763629 -13.0
South Carolina Transit 0.0759271 -45.0
North Dakota Residential -0.0751934 17.0
Oregon Workplace -0.0743095 -31.0
Kentucky Residential 0.0742810 12.0
Colorado Transit 0.0733622 -36.0
Indiana Grocery_Pharmacy -0.0731802 -5.5
Ohio Grocery_Pharmacy 0.0731550 0.0
West Virginia Residential -0.0726189 11.0
Michigan Transit 0.0724828 -46.0
Michigan Grocery_Pharmacy -0.0706962 -11.0
Washington Parks 0.0696346 -3.5
North Carolina Retail_Recreation 0.0683715 -34.0
Minnesota Residential -0.0646442 17.0
Minnesota Grocery_Pharmacy 0.0645655 -6.0
Arizona Parks -0.0594067 -44.5
Ohio Retail_Recreation 0.0585865 -36.0
South Carolina Retail_Recreation -0.0581319 -35.0
Oklahoma Workplace 0.0568721 -31.0
Florida Grocery_Pharmacy 0.0557136 -14.0
Kentucky Workplace -0.0551700 -36.0
South Carolina Grocery_Pharmacy 0.0543812 1.0
West Virginia Transit -0.0529547 -45.0
Missouri Parks 0.0527042 0.0
New Hampshire Transit -0.0505744 -57.0
Texas Retail_Recreation 0.0500502 -40.0
Washington Retail_Recreation -0.0465447 -42.0
Utah Grocery_Pharmacy 0.0460240 -4.0
Iowa Grocery_Pharmacy -0.0460194 4.0
Illinois Grocery_Pharmacy -0.0457732 2.0
New Hampshire Workplace 0.0450710 -37.0
Alabama Residential -0.0376847 11.0
South Dakota Grocery_Pharmacy 0.0374010 -9.0
Florida Transit -0.0373922 -49.0
Indiana Transit 0.0328631 -29.0
Colorado Grocery_Pharmacy -0.0319669 -17.0
Ohio Workplace -0.0312155 -35.0
Iowa Workplace 0.0309234 -30.0
Colorado Retail_Recreation -0.0307845 -44.0
Missouri Retail_Recreation -0.0306521 -36.5
Mississippi Parks -0.0297757 -25.0
Illinois Retail_Recreation 0.0242403 -40.0
Tennessee Grocery_Pharmacy 0.0238521 6.0
Nevada Workplace 0.0222558 -40.0
Kansas Residential -0.0213104 13.0
Iowa Retail_Recreation -0.0208197 -38.0
Nebraska Parks 0.0185586 55.5
New Mexico Workplace 0.0185014 -34.0
Montana Transit -0.0177719 -41.0
Iowa Residential -0.0157591 13.0
Kansas Retail_Recreation -0.0139608 -37.0
Oklahoma Retail_Recreation 0.0118993 -31.0
Nevada Grocery_Pharmacy 0.0109605 -12.5
Tennessee Transit -0.0083573 -32.0
Missouri Grocery_Pharmacy 0.0076220 2.0
West Virginia Retail_Recreation -0.0072694 -38.5
Tennessee Retail_Recreation -0.0072029 -30.0
Colorado Workplace 0.0067186 -39.0
Oklahoma Transit 0.0062270 -26.0
Maryland Parks -0.0056850 27.0
Vermont Transit -0.0046691 -63.0
Georgia Transit -0.0045906 -35.0
Wisconsin Retail_Recreation 0.0034416 -44.0
Arkansas Grocery_Pharmacy -0.0009703 3.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Fri May 29 23:10:52 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net